Volume 110, Number 6, November-December 2005 

Journal of Research of the National Institute of Standards and Technology 

[J. Res. Natl. Inst. Stand. Technol. 110, 605-612 (2005)] 

Bayesian Approach to Assessing Uncertainty 

and Calculating a Reference Value in Key 

Comparison Experiments 



Volume 110 



Number 6 



November-December 2005 



Blaza Toman 

National Institute of Standards 
and Technology, 
Gaithersburg, MD 20899-8980 

toman@nist.gov 



International experiments called Key 
Comparisons pose an interesting statistical 
problem, the estimation of a quantity 
called a Reference Value. There are many 
possible forms that this estimator can take. 
Recently, this topic has received much 
international attention. In this paper, it is 
argued that a fully Bayesian approach to 
this problem is compatible with the current 
practice of metrology, and can easily be 
used to create statistical models which 
satisfy the varied properties and 
assumptions of these experiments. 
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1. Introduction 

In October of 1999, the directors of the national 
metrology institutes of 38 of the member states of the 
Metre Convention, signed a document called the Mutual 
Recognition Agreement (MRA) [1] dealing with nation- 
al measurement standards. The objectives of the MRA 
are the establishment of degree of equivalence 
of national measurement standards maintained by 
the national metrology institutes, the recognition of 
calibration and measurement services provided by the 
institutes and consequently the establishment of a secure 
technical foundation for wider agreements related 
to international trade. The process of achieving these 
objectives is through international comparisons of 
measurements called Key Comparisons. The overall 
coordination of the Key Comparisons is through the 
International Bureau for Weights and Measures (BIPM) 
under the authority of the International Committee 
for Weights and Measures (CIPM). Details of the 
agreements can be found on the BIPM website, 
http://www.bipm.fr. 



There are general guidelines, given in Ref. [2], for the 
execution of the Key Comparison experiments, dealing 
with the selection of what is to be measured, who is to 
participate and how the results are to be disseminated. 
There are now Key Comparison experiments being per- 
formed in a great variety of disciplines. Some examples 
are temperature measurements, gas flow measurements, 
responsivity of photodiodes to various wavelengths of 
light, sound measurements and many more. Two classes 
of Key Comparisons have been defined in Refs. [3, 4]. A 
Class 1 Comparison is one where each participant meas- 
ures a local standard, or possibly a traveling artifact. In 
such an experiment there may be a single quantity that is 
being measured (measurand ) if the artifact is stable, but 
if not, there may be large systematic laboratory effects 
which need to be accounted for in the statistical model 
and analysis. Currently, most Key Comparisons belong 
to Class 1 and an example is given in Sec. 3. A Class 2 
Comparison is one where all laboratories take measure- 
ments of a single physical state or property. Thus in a 
Class 2 experiment, there is a clearly defined single 
measurand. 
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In either type of comparison, the laboratories per- 
form their measurements and report their results back 
to a pilot laboratory. The results from each laboratory 
consist of an average of a set of measurements (often 
with a small sample size) of the quantity being meas- 
ured and the total uncertainty in the measurements. The 
calculation of the uncertainty is prescribed in the Guide 
to the Expression of Uncertainty in Measurement [5], a 
publication of the ISO. The pilot laboratory assembles 
all of the data and calculates various measures of agree- 
ment among the participating laboratories. The exact 
form of the analysis is not prescribed, and there is con- 
siderable debate in the metrology community about this 
issue. However, it is stated in Ref. [2] that the pilot lab- 
oratory writes a report on the key comparison that should 
include "proposals for a reference value," called the Key 
Comparison Reference Value (KCRV). The KCRV is 
said to be "usually a close approximation to the corre- 
sponding SI value." In a Class 2 Key Comparison, the 
KCRV is clearly an estimate of the measurand. Class 1 
Key Comparisons may have multiple measurands and 
so the interpretation of the KCRV is not straightfor- 
ward. To date, there have been numerous proposals for 
the calculation of the KCRV, The articles given in Refs, 
[6 - 10] present some of the methods and more can be 
found in their references. The published methods have 
so far been based mostly on frequentist statistical mod- 
els, even though, the Guide uses belief-based definition 
of probability, and so can be said to be more compati- 
ble with Bayesian statistical methods. 

This paper proposes fully Bayesian models for Key 
Comparison data analysis. It is shown that their 
assumptions are compatible with the approach of the 
Guide. The models allow for a unified approach to the 
analysis of Class 1 Key Comparisons, whether or not 
systematic laboratory effects are present. In Sec. 2, the 
Guide s definition of measurement uncertainty is fully 
described and a compatible Bayesian mathematical 
model for measurement is presented. Sec. 3 contains an 
example of a Key Comparison experiment. Sec. 4 gives 
the statistical models for Key Comparison data. Sec. 5 
presents the analysis for Key Comparisons with a 
single and with multiple measurands. Sec. 6 presents 
the analysis of the example Key Comparison. 
Conclusions follow in Sec. 7. 

2. The Measurement Model 

The approach to quantification of uncertainty in 
measurement, which is now widely used in the physical 
sciences, is that presented in the Guide to the 



Expression of Uncertainty in Measurement. The basic 
idea of the Guide is to approximate a measurement 
equation 



H = g(0 1 ,...,O p ), 



(1) 



where g is a known function, // denotes the measurand 
and Q h ...,Q p denote p input quantities (random vari- 
ables), by a first order Taylor series about the expected 
values of the ( s. The uncertainty in the measurand, 
denoted u(fi), is then defined as the standard deviation 
of the probability distribution of // based on this linear 
approximation, that is 



"00 = J5>? var (3 ) +2 I W covft, Oj ), ( 2 ) 

V 

where the c t are the partial derivatives of// with respect 
to b var denotes the variance, and cov denotes the 
covariance. The Guide uses an interpretation of proba- 
bility consistent with the Bayesian paradigm, that is, 
probability distributions of the ( and // summarize our 
knowledge about these quantities. The expected values 
and variances of the 9 t may be based on actual physical 
measurements or on other information such as expert 
opinion. The Guide defines two types of uncertainty 
evaluations. Type A is "by the statistical analysis of 
series of observations" and this has usually been inter- 
preted as "using sample standard deviations." Type B is 
by "other means" and this has usually meant using 
manufacturer specifications, expert knowledge or even 
data from additional experiments. An example of a 
Type B evaluated uncertainty is the uncertainty in the 
internal volume of a 100 ml flask being used in a chem- 
istry experiment. Here, the manufacturer may give a 
volume of 100 ml ± 0. 1 ml. This could be interpreted as 
the volume having a rectangular distribution on the 
interval (99.9 ml, 100.1 ml), that is, having a standard 
deviation of 0.058. A key idea is that the data from the 
present experiment is not informative about sources of 
uncertainty evaluated by Type B methods. Such uncer- 
tainty is due to systematic effects that influence all of 
the observations in the experiment, such as for exam- 
ple, a flask not really having a volume of 100 ml. 

The most common analysis of a metrology experi- 
ment estimates the expected value of// by j;, the output 
of the measurement equation 



y 



= g(x 1 ,...x r ,X 1 ,...,X s ) 



(3) 



where p = r+s, r of the input quantities having physi- 
cal measurements and s of the input quantities being 



606 



Volume 110, Number 6, November-December 2005 

Journal of Research of the National Institute of Standards and Technology 



based on other information. The x- t represent sample 
means of the measurements used to estimate b X { rep- 
resent the subjective evaluations of the means of the 
remaining input quantities. The uiji) is approximated as 
uiy) as follows. The variances of the frequency-based 
distributions of the r sample means are usually used for 
the corresponding variances in Eq. (2). Subjective eval- 
uations are used for the variances of the remaining 0/s. 
The usual interpretation of the y and the uiy) is as the 
mean and standard deviation of a probability distribu- 
tion of the measurand. From a statistical perspective, 
this usage represents methodology that is neither total- 
ly frequentist nor totally Bayesian, but can be viewed 
as an approximate solution to a Bayesian inference 
problem. For further discussion of this subject see Refs. 
[11, 12]. 

In the analysis of Key Comparison experiments, the 
pilot laboratory receives the values y and the uncertain- 
ties uiy) from all of the participants. Generally the par- 
ticipants can also provide an estimate of the repeatabil- 
ity component of the uncertainty. Even though, it is 
understood that y and uiy) are features of a probability 
distribution for a measurand, no information about the 
shape of this distribution is provided. For a single lab- 
oratory, the following Bayesian statistical model gives 
results consistent with the Guide s measurement model, 
and has the added benefit of being easily extended to 
accomplish the KCRV estimation 



f(fi\y)-- 



Y\G,<7 2 ~ N(G,o 2 ) 
0\^t 2 ~N(ii,<c 2 ) 
jl\m,(0 2 ~ N(rn,G) 2 ). 



(4) 



The notation Y \ 0,<J 2 ~f(Y\0,<J 2 ) = N(0,<J 2 ) rep- 
resents conditioning. That is, the probability distribu- 
tion of 7, given and (7 2 , is f(Y\6,<7 2 ) which is a 
Gaussian (Normal) distribution with mean and vari- 
ance a 2 . The participant's inputs are a sample mean y, 
a sample standard deviation s 2 (an estimate of o 2 ) and 
the remaining uncertainty t 2 . In this model, stage one 
in the hierarchy is used to quantify the usual sampling 
variability of y, that is, the uncertainty component due 
to repeatability. Stage two represents the remaining 
uncertainty, both that evaluated by Type A and Type B 
methods. Stage three is a prior distribution of the 
measurand jx. Normal distributions are used in this 
article but other forms of probability distributions can 
easily be substituted when appropriate. Generally, a 
non-informative stage three prior distribution on jx 
would be used, that is, allow (& —> °°. Application of 
Bayes Theorem 



f(y\e,a 2 )f(e\n,r 2 )f(n) 

jf(y\e,<T 2 )f(e\n,r 2 )f(n)dn 



leads to a posterior distribution for jx 

jl\y^ 2 ,o 2 ~N(y^ 2 +o 2 ). (5) 

In the above model the variance a 2 is assumed to be 
a known quantity. As this is generally not true, a 2 
would be estimated by the sample variance s 2 . When t 2 
dominates a 2 , as is often the case in high precision 
physical measurements, or when the sample size on 
which s 2 is based is large, the posterior distribution of 
jx can be well approximated by 



fi\y,T 2 ~N(y 9 i? + 



n 



(6) 



(When the relationship between t 2 and o 2 is less 
extreme, it is better to assign a 2 a non-informative prior 
distribution and obtain the posterior distribution of ji by 
Markov Chain Monte Carlo methods. This will be 
shown in an example). 

Thus the approximate posterior mean and posterior 
standard deviation arising from Eq. (4) are in fact the 
quantities recommended for estimation by the Guide. 
In the next section, an example of a Key Comparison 
experiment is described in order to motivate the 
proposed modeling and analysis. 

3, Example of a Key Comparison 
Experiments,Vibration 
Acceleration (CCAUV.V-K1) 

The aim of this experiment in the area of vibration 
and shock measurement was to compare measurements 
of sinusoidal linear accelerations in the frequency range 
from 40 Hz to 5 kHz. During the period from January 
2000 to June 2001, 12 national metrology institutes 
used two different accelerometers, one of single-ended 
design and one of back-to-back design, to measure 
charge sensitivity at different frequencies. The charge 
sensitivity was given in pico coulomb per meters per 
second squared [pC/(m/s 2 )]. All laboratories followed 
the same measurement protocol, controlling tempera- 
ture and relative humidity and other external variables 
which could affect the measurements. The German 
institute Physikalisch Technische Bundesanstalt (PTB) 
was the pilot laboratory, responsible for checking the 
long term stability of the accelerometers. These were 
hand-carried from the pilot laboratory to the participat- 
ing institutes in a closed box by representatives of the 
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various institutes. The data from the Key Comparison 
was published in Report on Key Comparison 
CCAUV.V-K1 [13] and is now available for further 
study This is a Class 1 Key Comparison with a travel- 
ing artifact, which can be considered as having a single 
measurand. 

A subset of this data, charge sensitivity for the 
single-ended accelerometer at 40 Hz , is given here 
in Table 1. The table contains the mean values (j,), 
the repeatability component of the uncertainty (s t ) 
and the Type B evaluated uncertainties (r,-) for this 
measurement. 

The mean measurements are averages over five 
measurements for all but three laboratories. Laboratory 
one took nine measurements, laboratory eight took 
three measurements, and laboratory nine took four 
measurements. The s t are the sample standard devia- 
tions of the means. Each laboratory calculated their 
uncertainty values by the usual error propagation tech- 
niques, see Ref. [5], and included terms such as uncer- 
tainty from possible voltage disturbances, phase noise, 
uncertainty in the vibration frequency measurement 
and others. All of the possible sources of uncertainty 
that were to be considered are described in the publica- 
tion [14]. Each laboratory assessed their Type B uncer- 
tainty independently of the other laboratories and with- 
out knowledge of the other laboratories' measurements. 

4. Two Models for Key Comparison Data 
4.1 Multiple Means Model 

A Key Comparison experiment is a multi-laboratory 
study. If we treat all laboratories data totally independ- 
ently from each other, that is, if we assume that there 
are no relationships between the measurands or the 
uncertainties of the various laboratories, we can extend 

Table 1. Charge sensitivity and the associated uncertainties 



the statistical model given in Eq. (4) as follows. 
For / = 1,. . ., k 9 where k is the number of laboratories, 
we have 



Y t \0 t ,o, 2 ~ N(0 t ,o t 2 ) 
ju,. |iw,.,ffl 2 ~ Nim^co 2 ) 



(7) 



The posterior distributions of the jU, can be approximat- 
ed by 



H i \y i ,T 2 ~N{y i ,T 2 + s 2 ), 



(8) 



and so the standard uncertainty of each laboratory is 
approximately the Guide recommended quantity. 

4.2 Single Mean Experiment 

Suppose now that there is a common measurand as 
would be true in all Class 2 Key Comparisons and 
in most Class 1 Key Comparisons. Equation (7) can be 
modified to reflect this fact: 



Y^o 2 ~ N(9 n a 2 ) 

O t \H„x 2 ~N(n n x 2 ) 
H,\H,Y 2 ~ N(n,f) 

fi~ N(m, CO 2 ) 
Y~U(0,c). 



(9) 



The notation £/(0, c) represents a rectangular 
(Uniform) distribution on the interval (0, c), where 
c is a constant. The prior distributions on the fj, i are 
now hierarchical, the common mean ji being the 
measurand of the entire experiment. Note that y 2 



Laboratory number 



y t in pC/(m/s ) 



Vn 10 5 pC/(m/s 2 ) 



T f inl0 5 pC/(m/s 2 ) 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 



0.12901 
0.12914 
0.12924 
0.12874 
0.12960 
0.12890 
0.12875 
0.12870 
0.12853 
0.12830 
0.12950 
0.12877 



3.6 
5.5 
8.9 
6.1 

55.9 
3.9 
2.4 
9.6 
9.9 
4.4 
5.5 

11.6 



19 

73 
57 
66 
85 
72 
43 
39 
53 
54 
43 
132 
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represents variability due to systematic laboratory 
effects. When a prior distribution for it is specified, y 2 
can be estimated from the combined data of the partic- 
ipating laboratories. Recall that for each individual lab- 
oratory, the contribution of such systematic laboratory 
effects to their uncertainty is not estimable by their own 
data and so is evaluated by Type B methods and is part 
of T;. For example, in the CCAUV.V-K1 key compari- 
son, the uncertainty in the vibration frequency meas- 
urement is of this type. So is essentially any uncertain- 
ty attributed to the individual laboratory's technique. 
Thus the uncertainty estimate of// based on Eq. (9) is 
somewhat conservative as it is impossible, without a 
complete list of all Type B evaluated uncertainties from 
each participant, to separate out the effects which are 
estimable from the pooled data and the effects which 
are not. When the a\ are considered unknown and are 
given prior distributions, analysis based on Eq. (9) must 
be done numerically as closed form solutions are not 
available. Markov Chain Monte Carlo methods can 
readily be used, see Ref. [15] for sample computer 
programs. 

A special case of Eq. (9) is when y 2 = 0, that is, when 
all laboratories are assumed to be truly measuring the 
same quantity. In such a case, allowing of — » <*> and fol- 
lowing Ref. [16], it can be shown that approximately, 
the posterior distribution of ji is normal with mean ji p 



and standard deviation (O p where 



^ =—> 



±y>tf+tT 



(10) 



XfcW) 



%=- 



x(^r 



en) 



5, Analysis 

5.1 Single Measurand Experiment 

Both Eqs. (7) and (9) can be used to construct a 
KCRV and its uncertainty for a single measurand exper- 
iment. 

First, consider using Eq. (9). In this case, the meaning 
of the KCRV is clear and so is its estimation. It is plain- 
ly an estimate of the common measurand ji and can be 
provided by the mean of the posterior distribution. The 
uncertainty is the posterior standard deviation of //. 
If y 2 is set to 0, the solution [Eq.(10)] is the most 
commonly used KCRV estimate described in Ref. [6]. 



However, unlike in that publication, here it is derived 
based on a Bayesian model. The underlying assumption 
of a common laboratory mean has been questioned in 
the literature, see for example Ref. [17]. Equation (9) 
with a prior distribution on y 2 provides a sensible alter- 
native, one that allows for systematic laboratory differ- 
ences, provides a more conservative estimate of the 
uncertainty of the KCRV and also allows for a degree 
of validation of the stated uncertainties. The publica- 
tion [17] takes another approach and directly models 
the differences between the measurand ji and the //,. 
The resulting KCRV is somewhat related to that 
described in Sec. 5. 2. 

It is possible that the participants of the Key 
Comparison may not wish to allow the mathematical 
model to pool their data automatically, but want to 
determine the form of the KCRV more directly. An 
alternate approach then would be to use the multiple 
means Eq. (7) and to construct a KCRV based on it. 
This requires a synthesis of the probability distributions 
of the fij into a single distribution for //. The literature 
on such methods is rich and has been reviewed in Refs. 
[18] and [19]. An approach that is sensible for the cur- 
rent application is the Supra-Bayesian technique given 
in Ref. [20]. This can be described as follows: 

A single person with vague prior knowledge of a 
parameter ji consults k experts who provide the means 
(in our notation y f ) and standard deviations (in our 

notation ^tf + sf ) of their probability distributions for 

ji. The person then combines the k experts' distribu- 
tions into a single probability distribution. He does this 
by first specifying a normal likelihood function to 
express his opinion about the experts' knowledge and 
then using Bayes Theorem. Namely he specifies that 
the distribution 

P(ji J ». s jjT 1 ,...T jt , ( s lJ ..., ( s jt ,ju) (12) 

is multivariate normal with means a { + /?,// , standard 

deviations k\ ^rf + sf , and correlations p yY for / = 1 , . . . , k. 

In this way he can express his beliefs about the possible 
biases (in terms of the a { and #) of the experts, about 
their precision (in terms of k* ; ), and to what extent their 
assessments are correlated or not. In the case of no 
correlation between the laboratories, the resulting 
posterior distribution for ji is normal with mean 



Mb=- 



2>V(t, 2 +*?)"' 



(13) 
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and variance 



0)n 



IAV(t, 2 +< 



(14) 



Note that when the a t are set to 0, and the /J f - and K- t are 
set equal to one, these expressions become Eqs. (10) 
and (11). Note that this is again the weighted mean 
estimator as described in [6]. The most frequent criti- 
cism of this analysis is based on the belief that the val- 
ues of the Type B uncertainties cannot be considered as 
well determined quantities but rather only as estimates 
of the underlying variability. Taking the K t not as con- 
stants but giving them a probability distribution can 
model this fact. 

For simplicity it will be assumed here that the 
laboratories' results are independent and thus that the p t 
are equal to 0. This is generally a reasonable assump- 
tion in Key Comparisons but can be relaxed if neces- 
sary. Reference [20] shows that without loss of gener- 
ality the a { may be set to and the $ % to 1 , when the 
following probability model for the K- t is employed: 



v. c z 



' AU; ' 



(15) 



Note that E(k?) = cfvJ^-2) and the coefficient 
of variation of K t is (v/2 - 2)" 1/2 . The values of c t and the 
degrees of freedom v,- therefore specify the location and 
the spread of the distribution of K t . The selection of the 
values can be aided by noting that K t is approximately 
normal with mean log c t and variance (2 V,)" 1 , which 
further implies that approximately af 1 c l < k; < a t c h for 
a l such that log a,. =^2/v~. Further discussion of these 
relationships appears in Ref. [20] and also in Ref. [21]. 
Combining the likelihood function [Eq. (12)] and the 
prior distribution [Eq. (15)] via Bayes Theorem for a 
single laboratory i results in 



(p-M 




W+* 1 



I 



(16) 



having a student t vi distribution. Because of the inde- 
pendence of the laboratories, using Bayes Theorem 
with Eqs. (12) and (15) for all k laboratories results in 
a posterior distribution of jx which is a product of the 
t V i (i = l,..., k) distributions. An interesting property of 
this distribution is that it can be multi-modal when 
there is strong disagreement among the laboratories. 



This model, in a more general context, is also discussed 
in some detail in Ref. [22] and further generalized by 
Ref. [23]. 

The KCRV can be taken to be the mean or possibly 
the median of this distribution. The uncertainty is the 
standard deviation of this distribution. Such quantities 
cannot be obtained in closed form but can easily be 
computed using standard Markov Chain Monte Carlo 
methods (Ref. [15]). Interestingly, the same distribution 
was derived using a different model and different moti- 
vation by [24] for the problem of combining data which 
appear to be in mutual disagreement. 

Both approaches, that is KCRV estimation based 
on Eq. (9) or the Supra-Bayesian KCRV based on 
Eq. (7), are reasonable. Arguably, the Supra-Bayesian 
method introduces fewer assumptions, and allows a 
more direct modeling of possible inaccuracies in the 
individual Type B evaluations. On the other hand, the 
straightforward interpretation of Eq. (9) and the 
possibility of data-based estimation of the effects of the 
systematic laboratory effects makes this approach 
appealing as well. Analysis of data from CCAUV.V-K1 
key comparison in Sec. 6 will illustrate the two 
approaches. 

5.2 Multiple Measurand Experiments 

In some Class 1 experiments there are clearly 
defined multiple measurands. In such a case, Eq. (7) 
would be used. The question then is again how to esti- 
mate the KCRV since it has no natural interpretation. 
The Supra-Bayesian solution given in Sec. 5.1 is not 
applicable here since there is no common measurand. 
One possible solution is the following method, based 
on the so-called linear opinion pool, which dates back 
to Laplace. In this method k probability distributions 
Pi ( ) are combined as 



p( )=!>#( ) 



(17) 



where the weights w t add up to one. In the present 
application, the k laboratories posterior distributions for 
jii could be combined into the mixture distribution of a 
new random variable ji, namely 



/?(A*) = 5>/ 



>/2*(*,W) 






(18) 



This is using the Gaussian density for p it In most 
cases, the weights w t would be taken to be \ik repre- 
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senting a view that the laboratories' data are of equal 
quality. The mean of this distribution, that is, the 
average of the k laboratories measurements would 
be taken as the KCRV. The standard deviation of this 
distribution 



u(y) = 



Efc 2 +*i) X(*-tf 



(19) 



Table 2. Comparison of KCRV estimates 



Common Lab-effects Supra- Linear- 
mean model model Bayes Pool 
pC/(m/s 2 ) pC/(m/s 2 ) pC/(m/s 2 ) pC/(m/s 2 ) 

KCRV 0.12892 0.12894 0.12894 0.12893 

Uncertainty 0.000128 0.000167 0.000253 0.000791 



being the standard uncertainty of the KCRV. The linear 
opinion pool is an easily understood and easily per- 
formed method. It is intuitively pleasing because the 
weights can be thought to represent the relative quality 
of the laboratories' results. The estimator, with a 
frequentist interpretation, was used in Refs. [3] and 
[25]. Its main appeal in the Bayesian context, is that 
u(y) can be thought to represent the total variability in 
the population of measurands of the Key Comparison. 
This can be viewed as the true measure of uncertainty 
in such a Key Comparison, because of the assumed 
equality of the laboratories in terms of their compe- 
tence. 

6. Example: Analysis of CCAUV.V-K1 

The Key Comparison in Vibration Acceleration is 
an interesting example of Class 1 Key Comparison 
with a traveling artifact. The data is given in Sect. 3.1. 
Table 1 summarizes the results of the analysis. Using 
Eq. (9), various results can be obtained depending 
on the value of y 2 . In Table 2, the results for y= 
are labeled "common mean model." The results using 
a uniform prior distribution on y are labeled "lab- 
effect model." The third column contains the Supra 
Bayes estimate, one with c,-= 1.0 and a 7 = 2.0, that is, 
0.5 < Ki < 2.0, (i = 1,...,12). This choice of c t and a- t 
gives a reasonable range of possible values for the 
standard uncertainties in this particular experiment as 
the declared standard uncertainties based on Table 1 
range from a minimum of about one third of the 
average standard uncertainty to a maximum of about 
twice the average. The fourth column contains 
the results based on the linear pool estimator. In this 
particular Key Comparison, the separate measurands 
model is possibly too extreme as there truly was a 
single measurand. However, the separate measurands 
model can be thought of as a limiting case of a 
systematic lab effects model and as such can be used 
to obtain an upper bound on the uncertainty of the 
KCRV 



It is clear from Table 2 that even though the values of 
the Bayesian KCRV estimates under the various 
assumptions are very similar, their uncertainties are 
not. Thus, it is clearly important to examine the 
assumptions underlying the various analyses and make 
sure that they are reasonable. Equation (9), without the 
restriction on y, quite objectively estimates the under- 
lying variability due to systematic lab effects. It allows 
for the uncertainty to be as low as that given in Eq.(l 1), 
when the data supports it. In this case, the data clearly 
indicates that the uncertainty is larger. The posterior 
mean of /is 1.9421E-4 with posterior standard devia- 
tion of 1.5838E-4. This gives a 95 % Highest Posterior 
Density Interval (HPD) of (8.2604E-6, 5.9033E-4). 
This is the shortest interval of possible values for y 
which has probability of 0.95. For comparison, the 
individual laboratories' Type B evaluated uncertainties 
ranged from 1.8974E-4 to 1.3153E-3. Note that these 
included terms based on random laboratory effects 
(estimated by f) as well as terms based on other factors. 
Thus on the whole, the stated Type B evaluated uncer- 
tainties are reasonable, and the lab-effects-model 
KCRV would make a good choice. 

The published analysis of CCAUVV-K1 used 
Eqs. (10) and (11) for the KCRV and its uncertainty. 
The analysis (referred to as the common mean model 
above) was described as "correctly reflecting the 
declared uncertainties of the individual laboratories." 
There is a discussion in the report, from a frequentist 
perspective, of statistical issues concerning the under- 
lying assumptions of this analysis. Because of concerns 
with the validity of these assumptions, other frequentist 
estimates of the KCRV, including the average y, the 
median of the y h and a Maximum Likelihood Estimator 
(Ref. [23]) were computed. The report concluded that 
the values of these estimators were similar enough to 
justify the use of the common mean model for the 
KCRV in this key comparison. The report did not 
explicitly show the uncertainties associated with the 
various estimators. 
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7. Conclusions 

Key Comparison experiments, performed in various 
sub-disciplines of physics and chemistry, pose numer- 
ous challenges to the analyst. Most importantly, Type B 
evaluated uncertainty must be included in the statistical 
model in a meaningful way, one that satisfies both the 
scientist and the statistician. Further, the scientific 
objectives of the experiment must be reflected in the 
statistical summaries and the results must be compliant 
with the Guide to the Expression of Uncertainty in 
Measurement. It is shown in this paper, that the 
Bayesian paradigm allows flexible modeling of Type B 
evaluated uncertainty and that it can produce estimates 
that are satisfactory to the needs of the scientists. 
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